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Abstract 

This paper shows that code expanding optimizations have strong and non-intuitive impli- 
cations on instruction cache design. Three types of code expanding optimizations are studied 
m this paper: instruction placement, function inline expansion, and superscalar optimizations. 
Overall instruction placement reduces the miss ratio of small caches. Function inline expansion 
improves the performance for small cache sizes, but degrades the performance of medium caches. 
Superscalar optimizations increases the cache size required for a given miss ratio. On the other 
hand thev also increase the sequentiality of instruction access so that a simple load-forward 
scheme effectively cancels the negative effects. Overall, we show that with load forwarding, the 
three types of code expanding optimizations jointly improve the performance of small caches 
and have little effect on large cachet,. 

Index terms- C compiler, code optimization, cache memory, code expansion, load forwarding, 
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hors are will, the Center for Reliable and High- Performance Computing. Pniversity of Illinois. 


* 1 111: fill! 

( 'ham pcLigu, Illinois, 01&U1 


l Tbana- 


1 

PRECEDING PAGE BLANK NOT FILMED 


1 Introduction 


Compiler technology plays an important role in enhancing the performance of processors. Many 
code optimizations are incorporated into a compiler to produce code that is comparable or better 
than hand- written machine code. Classic code optimizations decrease the number of executed 
instructions [1]. However, there are factors limiting the effectiveness of these optimizations. For 
example, small function bodies limit the scope of optimization and scheduling. To increase the 
scope of code optimization, inline function expansion is performed by many compilers [2] [3] [4]. 
Function inlining replaces a. function call with the function body. To further enlarge the scope of 
code optimization and scheduling, compilers unroll loops by duplicating the loop body several times. 

I he 1 M 1’A ( T-I ( compiler utilizes inline expansion, loop unrolling, and other code optimization 
t echniq ues. These techniques increase the execution efficiency at the cost of increasing the overall 
code size. Therefore, these compiler optimizations can affect the instruction cache performance. 

1 his paper examines the effect of these code expanding optimizations on the performance of a 
w ide range oj instruction cache configurations. The experimental data indicate that code expanding 
optimizations have strong and non-intuitive implications on instruction cache design. For small 
cache sizes, the overall cache miss ratio of the expanded code is lower than that of the code 
without expansion. The opposite is true for large cache sizes. This paper studies three types of 
code expanding optimizations: instruction placement, function inline expansion, and superscalar 
optimizations. Overall, instruction placement increases the performance of small caches. Function 
bilitu' ’expansion improves the performance of small caches, but. degrades that of medium caches. 
Superscalar optimizations increases the cache size required for a given miss ratio. However, thev 
also increase l lie .-.equentiality of instruction access so that a. simple load-forward scheme removes 
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the performance degradation. Overall, it is shown that with load forwarding, the three types of 
code expanding optimizations jointly improve the performance of small caches and have little effect 

on large caches. 


1.1 Related Work 


Cache memory is a popular and familiar concept. Smith studied cache design tradeoffs extensively 
with trace driven simulations [5], In his work, many aspects of the design alternatives that can affect 
the cache performance were measured. Later, both Smith and Hill focused on specific cache designs 
parameters. Smith studied the cache block (line) size design and its effect on a range of machine 
architectures, and found that, the miss ratios for different, block sizes can be predicted regardless of 
the workload used [6]. The causes of cache misses were categorized by Hill and Smith into three 
n p^: conflict, misses, capacity misses, and compulsory misses [7]. The loop model was introduced 
by Smith and Goodman to study the effect of replacement policies and cache organizations [8]. 
They showed that under some circumstances, a small direct mapped cache performs better than 
rhe same cache using fully associativity with LRU replacement policy. The tradeoffs between a 


varieiy of cache types and oil-chip registers were reported by Lickeumeyer and Patel [9]. Ihis 
work showed that when the chip area is limited, a small- or medium-sized instruction cache is 
i [ iost cost effective way of improving processor performance. Przybylski et ul. studied the 
interaction of cache size, block size, and associativity with respect to the C PU cycle time and the 
Ul am memory speed [10]. This work found that cache size and cycle time are dependent design 
parameters. Alpert and Flynn introduced an utilization model to evaluate the effect ol tlw* block 
,j Z( . cache performance [11]. They considered the actual physical area ol caches and found that 
larger block sizes have better cost- performance ratio. All of these studies assumed an invariant 


compiler technology and did not consider the effects of compiler optimizations on the instruction 
cache performance. 

Load forwarding is used to reduce the penalty of a cache miss bv overlapping the cache repair 
with the instruction fetch. Hill and Smith evaluated the effects of load forwarding for different 
cache configurations [12]. They concluded that load forwarding in combination with prefetching 
and sub- blocking increases the performance of caches. In this paper a simpler version of the load- 
lorward scheme is used, where neither prefetching nor sub-blocking is performed. The effectiveness 
°l r l 11 ^ load-forward technique is measured by comparing the cache performance of code without 
optimizations and with code expanding optimizations. Load forwarding potentially can hide the 
effecrs of code expanding optimizations. 

Davidson and \ aughan compared the cache performances of three architectures with different 
instruct ion set complexities [13]. They have shown that less dense instruction sets consistently 
-ciifuatM more memory traffic. The effect of instruction sets of over 50 architectures on cache 
performance has been characterized by Mitchell and Flynn [14]. They showed that intermediate 
t ache sizes aie not suited for less dense architectures. Steenkiste [15] was concerned with the 
i * 'Li i ion ship between the code density pertaining to instruction encoding and instruction cache 
performance. He presented a method to predict the performance of different architectures based on 
f - ,4 ‘ rate of one architecture. 1 alike less dense instruction sets which typically have higher miss 
nxU ' f<)| caches [13]. we show that code expansion due to optimizations improves performance 

ol small caches, and degrades that of large caches. Our approach is also different from these previous 
111 fhat the instruction set is kept constant. A load/store RISC instruction set whose code 
T'ti.Mtv D close to that ol the MIPS R.2000 instruction set is assumed. 

Duderman and Flynn have simulated the effects of classic code optimizations on architecture 
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design decisions [16]. Classic code optimizations do not significantly alter the actual working sets 
of programs. In contrast, in this paper, classic code optimizations are always performed; code 
expanding optimizations that enlarge the working sets are the major concern. Code expanding 
optimizations increase the actual code size and change the instruction sequential and spatial local- 

ities. 

1.2 Outline Of This Paper 

Section 2 describes the instruction cache design parameters and the performance metrics. The 
cache performance is explained using the recurrence/conflict model [17]. Section 3 describes the 
code expanding optimizations and their effects on the target code and the cache design. Section 4 
presents and analyzes experimental results. Section 5 provides some concluding remarks. 


2 Instruction Cache Design Parameters 

2.1 Performance Metrics with Recurrences and Conflicts 


The dimension of a cache is expressed by three parameters: the cache size, the block size, and the 
associativity of the cache [5]. The size of the cache, 2 C , is defined by the number of bytes that can 
■bmnit aneously reside in the cache memory. The cache is divided into b blocks, and the block size. 
2 b . is the cache size divided bv b. The associativity of a cache is the number of cache blocks that 
d.ar- .he same cache set. An associativity of one is commonly called a direct mapped cache, and 

an associativity of 2 r_w defines a fully associative cache. 

11,,. metric used in many cache memory system studies is the cache miss ratio. Lins is the 
number of references that are not satisfied by a. cache at a level of the memory system 


hierarchy over the total number of references made at that cache level. The miss ratio has served as 
a good metric for memory systems since it is characteristic of the workload (e.g., the memory trace) 
vet independent of the access time of the memory elements. Therefore, a given miss ratio can be 
used to decide whether a potential memory element technology will meet the required bandwidth 
for the memory system. 

The recurrence/conflict model [ 17 ] of the miss ratio will be used to analyze the cause of cache 
misses. ( onsider the trace in Figure 1 . aj, 02,0.3. and a 4 are the first, occurrence of an access, and 
they are unique in the trace. The recurrences in the trace are accesses a^.a^.a- and «g. Without a 
c outext switch, all these four recurrences would result in a hit in an infinite cache. In the ideal case 
ol an infinite cache and in the absence of context-switching, the intrinsic miss ratio is expressed 

rt.S . 

-V - R 

( 1 ) 

where R is the total number of recurrences and Y is the total number of references. Note that 
an access can be of only two types: either a unique or a. recurrent access. Non-ideal behavior 
occurs due to conflicts, and this paper considers only the dimensional conflicts ; multiprogramming 
conflicts are considered in [ 18 ]. 

A dimensional conflict is defined as an event which converts a recurrent access into a miss 
due 10 limited cache capacity or mapping inflexibility. For illustration, consider a direct mapped 
cache composed of two one-byte blocks as shown in Figure 2. A miss occurs for recurrent access « s 


Reference 

<i\ a 2 (ifl, 

<i\ 

«5 

a- a* j 

Address 

0 1 2 
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2 1 2 j 


Figure 1 : An example trace of addresses. 
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* Dimensional conflict 

Figure 2: An example two-block direct- mapped cache behavior. 

reference « 4 purges address 1 from the cache due to insufficient cache capacity. Hence. a 4 
represents a. dimensional conflict for the recurrence a 5 . The other misses. ai,a 2 ,a 3 and a 4 . occur 
because these are the first references to addresses 0.1,2 and 3. respectively (i.e.. they are unique 
accesses i . Therefore, the following formula can be used for deriving the cache miss ratio, p. for a 
given trace, and a. given cache dimension. 

Co 


.V - (R- C D ) 

x = ^ + -F‘ 


2 ) 


her.' r t) is the total number of dimensional conflicts, and p,, is the intrinsic miss ratio. 


in a simple design, when a. cache miss occuis, 


instruction fetch stalls and the instruction cache 


waits for the appropriate cache block to be filled. After instruction cache repair is completed, 
the instruction fetch resumes. The number of stalled cycles is determined by three parameters: 
i he initial cache repair latency (L), the block size, and the cache-memory bandwidth (J). For a 
Tingle cache miss, i he number of stalled cycles is the initial cache repair latency plus the number 
ni l ra nsl’ers required to repair the cache block. The total miss penalty without load forwarding. 
i, expressed by t he number of total misses multiplied by Hie number of stalled cycles for a single 
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cache miss. 


2 B 

t-n = ( A — ( R — C p ) ) x ( L -T ) • ( 3 ) 

I iub is* the miss-penalty model used when load forwarding is not assumed. The miss penalty ratio 
is calculated by dividing the miss penalty, t n , by N . 

2.2 Load Forwarding 

Load forwarding was evaluated by Hill and Smith [12]. They concluded that load forwarding in 
combination with pretetching and sub-blocking increases the performance of the cache. In this 
paper, we use a simpler version of the load forwarding scheme where neither prefetching nor sub- 
blot king is periormed. The state transition diagram for load forwarding is shown in Figure 3. 
him iiisl ruction cache is in the standby state initially (state 0). When a cache miss occurs, the 
inst nu t ion letch st alls (state 1). Instead of waiting for the entire cache block to be filled before 
p's inning, the cache loads the block from the currently- referenced instruction and forwards the 
him na tion to the instruction fetch unit (state 2). Furthermore, if the instruction reference stream 
n sequential, each subsequent instruction is forwarded to the instruction fetch unit until the end 
ol the block is reached or a taken branch is encountered. Any remaining unfilled cache- block bytes 
are repaired in the normal manner, and the instruction fetch stalls (state 3). This load forwarding 
M I m 1 1 1 * • requires uu sub-block valid bits and therefore has a simpler logic for cache block repair than 
"ill) bit u'k- based M’heums. 

An "xample of the cache-block repair process with load forwarding is provided in Figure !. 

Ib'ln <‘ii< \ result"' in a miss. It takes /. cycles belon* this reference is placed in the appropriate' 
block location and is forwarded to the fetch unit. Reference F is a sequential access, thus it is 
considered as a hit. It is placed in the cache and forwarded to the fetch unit. Reference Z breaks 
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^ ^ instruction fetch unit not stalled 
instruction fetch unit stalled 



Figure 3: State transition diagram ot the load forwarding process. 
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h igure 4: Ail example of the load forwarding process. 


1 ii* v s c((uentiai-refereiice stream, load forwarding stops, and cache repair of block 0 continues. At 
cvrle UJ, the end of the block is reached, and the cache repair continues from the beginning of 
tin* ( u< he block. At cycle L-h3 % the entire cache block is filled, the fetch unit continues with the 
next instruction reference. The block wrap around time is assumed to be negligible compared to 
the total block-repair time l . References A and Y are sequential and constitute a run length (the 
nhiiiImt ot sequential instructions before a taken l>rauch| of 2. 

1 or the t tK cache miss, if the total number of bytes where the instruction fetch and cache repair 

tor f hr actual hardware implementation, the cache repair can start at the beginning ot t he cache block. When 
lie local ton of the instruction to be fetched is encountered within the cache block, load forwarding begins. Load 
forwarding terminates when the end of the block is reached or when a taken branch is encountered. Cache: repair 
" [ h** ‘‘lid of tli'- bloc k Idle miss penally incurred by this method is t he same as the one presented in the 

papm 
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overlap is represented by 5[»], the total miss penalty with load forwarding, is expressed as 


ti — t n — is 


( 4 ) 


where is 

{N -R)+c d r * r • 1 

<5= E f 1 - ,5 > 

1= 1 

t s measures the number of cycles saved by load forwarding. Equation 4 is the miss-penalty model 
used when load forwarding is assumed. The miss penalty ratio with load forwarding is calculated 
In dividing the miss penalty. by A . 


The saved cycles expressed in Equation 5 is constrained by two factors. First, load forwarding is 
limited by the sequentiality of the instruction reference stream. The more sequential the instruction 
reference stream is. the more overlap between the cache repair and load forwarding cycles that can 
l,e achieved. Second, assuming the sequentiality of the referencing stream is not a problem, load 
forwarding is performed only from the missed reference until the end of the block. Thus the savings 
is highly dependem upon the location of the miss within the cache block. The sequentiality of the 
reference stream can be increased by appropriate compiler optimizations and this will be discussed 
in Section 3. This second factor is highly variable and dependent, upon the instruction reference 


stream and the block size. 


3 Optimizations and Code Transformations 


3.1 Base Optimizations 


A sta 


I lie 


ndard set of classic optimizations is available in commercial compilers today 
of these optimizations is to reduce the execution time. Local optimizations 


(see Fable 1 ). 
are performed 


* . 

t y 


i ■■ - r r 


■Jj 


Global 


Local 


constant propagation 
copy propagation 

common subexpression elimination 

redundant load elimination 

redundant store elimination 

constant folding 

strength reduction 

constant combining 

operation folding 

operation cancellation 

dead rode removal 

code reordering 


constant propagation 
copy propagation 

common subexpression elimination 
redundant load elimination 
redundant store elimination 
dead code removal 
loop invariant code removal 
loop induction strength reduction 
loop induction elimination 
global variable migration 
loop unrolling 


Table 1: Base optimizations. 


within basic blocks, whereas global optimizations are performed across operations in different basic 


In this paper, these classic code optimizations are always performed on the compiled 


prugrn ms. 


3.2 Execution Profiler 

Kxernt ion profiling is performed on all measured benchmarks. The IMPACT 1 profiler translates 
<m< h target C program into an equivalent C program with additional probes. When the equivalent 
( program is executed, these probes record the basic block weights and the branch characteristics 
for each basic block. Profile information is used to guide the code expanding optimizations. The 
profile information is collected using an average 20 program inputs per benchmark. An additional 
input is then used in measure the cache performance. 
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3.3 Instruction Placement 


Reordering program structure to improve the memory system performance is not a new subject. 

Iu more recent literature regarding instruction caches, instruction placement has been shown to 
improve performance [19] [20] [21]. The IMPACT-I C compiler instruction placement algorithm 
improves the efficiency of caching in the instruction memory hierarchy [19]. Based on dynamic 
profiling, this algorithm increases the sequential and spatial localities, and decreases cache mapping 

conflicts of the instruction accesses. 

For a given function body, several steps are taken to reorder the instruction sequence. For 
each [unction, basic blocks which tend to execute in sequence are grouped into traces [22] [23]. 
Traces are the basic units used for instruction placement. The algorithm starts with the function 
entrance trace and expands the placement by placing the most important descended after it. The 
placement continues until all the traces with non-zero execution profile count have been placed, 
['races with zero execution count are moved to the bottom of the function, resulting in a smaller 

(‘fh'rtivr function body. 

Reordering the basic, blocks does not. increase the program size significantly. The overall se- 
quentiality of the resulting code is increased (i.e. the number of taken branches are reduced) due 
to the formation of traces, and this may increase the need for a larger cache block size. For the 
, alU e cache size, an increase in block size translates to a decrease in tag store. The overall locality 
of ,he resulting code is increased due to the placement of more important t races at the beginning 

of i lie fund ion. 
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3.4 Function Inline Expansion 


Function inline expansion replaces the frequently invoked function calls with the function body. The 
importance ol inline expansion as an essential part ol an optimizing compiler lias been described 
by Allen and Johnson [24]. Several optimizing compilers perform inline expansion. For example, 
the IBM PL. 8 compiler does inline expansion of ail leaf-level procedures [25]. In the GNU C 
compiler, the programmer can use the keyword inline as a hint to the compiler for inline expanding 
tunci ion calls [2], The Stanlord MIPS C compiler examines the code structure (e.g., loops) to 

< hoosp t he function calls for inline expansion [26], The IMPAC'T-I C compiler has an algorithm 
that automatically performs inter- file inlining assisted by the profile information where only the 
important function call sites are considered [4]. Inlining is done primarily to enlarge the scope of 
optimization and scheduling. 

Since the callee is expanded into the caller, inline expansion increases the spatial locality and 
decreases the number of function calls. This transformation increases the number of unique ref- 
erences. which may result in more misses. However, a decrease in the miss ratio may also occur, 
because without inline expansion the cailee has the potential to replace the caller in the instruction 

< <h he. V\ itlt inline expansion, this effect is reduced. Inline expansion provides large functions to 
enlarge the size of traces selected. This enlargement of function bodies helps to further the effec- 
tiveness ol instruction placement. With an increase in the sequentiality of the referencing stream, 
an improvement in the performance ol load forwarding can be expected. 

3.5 Optimizations for Superscalar Processors 

^iiuf basic block.-, typically contain few instructions, there is little parallelism within a basic block, 
l or superscalar processors, many code transformations are necessary in order to increase the num- 
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her of instructions available for scheduling. Many researchers have shown the effectiveness of 
these optimizations [27] [28] [29]. Although these optimizations are frequently used for super- 
scalar processors, these optimizations are also useful for scalar processors (e.g.. MIPS 0 compiler 
performs automatic loop unrolling [3]). The following superscalar optimizations have been imple- 
mented in the IMPACT-I C compiler and are performed in addition to function inline expansion 
and instruction placement. They have been shown to provide significant speedup on superscalar 

processors [30]. 

Super-block formation: A super-block is a sequence of instructions that can be reached only 
from i lie top instruction and may contain multiple branch instructions. A trace can be converted to 
a super-block by creating a copy of the trace and by redirecting all control transfers to the middle 
of the trace to the duplicate copy: thus, super-block formation, or trace duplication, increases code 


optimization and scheduling freedom. 

Loop unrolling: The body of a loop is duplicated to increase the number of instructions in 
the super-block. To unroll the loop .V times, the body of the loop is duplicated (A - 1) times. For 
multiple instruction issue processors, the IMPACT-I C compiler typically unrolls small loops four 
or more times. For larger loops. N decreases according to the loop size. 

Loop peeling: Many loops iterate very few times, (e.g., less than ten). For these loops, loop 
unrolling and software pipelining are less effective because the execution time spent in the parallel 
s.-rrion 1 1 lie optimized loop body) is not substantially longer than in the sequential section (the loop 
prologue and epilogue). An alternative approach to loop unrolling is to peel oft enough iterations. 

,iirli that the loop typically executes as a straight-line code. 

Branch target expansion: Instruction placement and super-block formation introduce many 
|, ranch instructions. Branch target expansion helps to eliminate the number of taken branches by 
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program 

description 

object code size 

(bytes) 

instruction 

references 

cccp 

GNU C preprocessor 

20400 

2.89 x 10 7 

eqntott 

truth table generator 

15256 

1.47 x 10 8 

espresso 

boolean minimization 

61264 

5.48 x 10 7 

mpla 

pla layout 

138808 

1.07 x 10 8 

tbl 

format table for troff 

24804 

3.08 x 10 7 

xlisp 

lisp interpreter 

31920 

1.46 x 10 8 

yacc 

parsing program generator 

21320 

3.47 x 10 7 


Table 2: Benchmark program characteristics. 

copying the target basic block of a frequently taken branch into its fall- through path. The number 
ot static instructions increases due to this optimization. 

Super- block formation . loop unrolling, loop peeling, and branch target expansion increase the 
sequentiality ol the code. Loop unrolling and loop peeling decrease both spatial and temporal 
locality. A reduction in cache performance can be expected due to a decrease in spatial locality. 

I he increased code size and increased unique references can be expected to increase the cache size 
requirement.. 


4 Experiments and Analysis 

4.1 Benchmark Programs 

lable 2 shows the benchmark programs that are used in this paper. Three of the programs. 
iijutnti. < spn sso. and xlisp. are from the SPEC 2 benchmark set [:}Lj. Pour other (’ programs. 
tapla . reef), paee. and tbL are commonly used scalar programs. The object rode size column gives 
Mm 1 program size in bytes without any code expanding optimizations. The size of these benchmark 
1 1 1 1 v * * ts 1 1 y ot Illinois is a member of SPEC'. 


If) 
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programs are large enough for studying instruction caches. The instruction references column gives 
the corresponding number of dynamic instruction references. These instruction references are for 
the full run of each benchmark program, no sampling or reference partitioning is used. 

4.2 Measurement Tools 

The measurement results are generated by trace driven simulation. To collect the instruction 
t races, the compiler's code generator was modified to insert probes into the assembly language 
program. Executing the modified program with sample input data produced the instruction trace. 
The i races consist of the IMPACT assembly instructions ( LCODE J ) which is similar to the MIPS 
R2000 assembly language [32]. 

Since t he performance number for many cache dimensions are needed, a one pass cache simulator 
i> hvhI . The cache simulator for the experiments uses the recurrence/conflict model [17]. where 
only one pass over the instruction trace is needed to simulate all cache dimensions. Snnilaily. 
the information required to derive miss penalty with load forwarding is collected for all cache 
dimensions. In this paper, associativity of one-way, two-way. four-way. and fully- associative are 
siimilaied. The block sizes considered are 16, 32. 64. and L28 bytes. The cache sizes range from 
LI\ to 128K bytes. 

4.3 Empirical Data and Analysis 

lo, .he purpose of experimentation, the code expanding optimizations described in Section 3 are 
organized into four optimization levels with increasing functionality: no (no code expanding op- 
. iinizai ion i. pi ( in.-i rud ion placement), in (function inline expansion plus instruction placement). 

I ( u|)|\ dociimcnl a. ion is available as an internal report. 

i TE IS 
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Table 3: Accumulated code size increase. 


and * 0 / (superscalar optimization, function inline expansion, and instruction placement). Experi- 
ment > are conducted by varying the optimization level to measure the incremental and accumulative 
effect* of these optimizations. 


General Effects 

In order to quantify the effect of optimization on code size, the object code size was measured for 
1 ' a c 1 1 level ot optimization. I able 3 shows the relative object, code size for each optimization level. All 
ta t i< o end percentage* are computed based on the code size without code expanding optimization. 
Instruction placement increases the average code size bv 2%. Function inline expansion results in a 
1-V/ code expansion after instruction placement, as indicated by the 17% increase in average code 
d/e in the m column of Table 3. Superscalar optimization further increases the code size by 38%. 
alter both inline expansion and instruction placement. The total code expansion due to all the 
tbrer optimization* is -V>%. which reinforces the concern that these optimizations may degrade the 
instruction cache performance. 

1 In* instruction working set of a program is defined as the smallest tully-a.ssocia.tive instruction 
' a < ' I [ o which achieve* a 0.1% miss ratio for the program. It, provides a relative measure of cache 
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Table 4: Working set size for various block sizes in lo<j 2 cache size 


program 

num 

10 

% inc 

pl 

num | % inc 

in 

num \ % inc 

su 

num | % inc 

cccp 

5.1 

- 

7.5 

47 

7.7 

50 

10.5 

105 

eqntott 

3.8 

- 

5.9 

53 

5.9 

54 

5.9 

54 

espresso 

6.4 

- 

8.4 

31 

9.1 

42 

14.8 

131 

mpla 

5.1 

- 

8.9 

76 

9.9 

96 

17.81 

253 

tbl 

3.5 

- 

4.9 

42 

6.4 

84 

13.1 

278 

xlisp 

4.2 

- 

6.3 

50 

9.5 

129 

10.8 

159 

vacc 

4.0 

- 

5.9 

47 

6.1 

51 

13.0 

223 

ave rage 

4.6 

- : 

6.8 

48 

7.8 

70 

12.3 

167 1 


Table 5: Average number of sequential instructions. 

size requirement by programs. Table 4 presents the instruction working set size of each benchmark 
for all optimization levels. All numbers presented are in log 2 scale (e.g.. 14 is a 16R byte cache). 
The largest working set size needs at most a 32 1< byte cache. All miss ratios for the larger caches 
are considered negligible, and for this reason, cache sizes larger than 32K will generally not be 
^ how 1 1 in this paper. Instruction placement and function inline expansion have very little effect on 
i |,e instruction working set size. Superscalar optimization approximately double the instruction 
working set size. This is expected since superscalar optimizations results in the largest increase in 
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program 

base 

no 

pi 

% chang 
in 

e 

SU 

cccp 

2.89 x 10 7 

-0.27 

-2.01 

-3.17 

eqntott 

1.47 X 10 s 

-0.42 

-0.43 

-0.45 

espresso 

5.48 x 10 7 

+0.18 

-1.23 

-3.33 

mpla 

1.07 x 10 s 

-0.62 

-6.18 

-10.1 

tbl 

3.08 x 10 7 

+0.21 

-12.3 

-16.2 

xlisp 

1.46 x 10 8 

-1.84 

-14.6 

-16.7 

yacc 

CO 

-a 

X 

i — » 

o 

-1.00 

+0.13 

+6.53 


Table 6: Number of dynamic references. 

discussed in Section 3, all of the three code expanding optimizations can improve the sequen- 
riaiitv of instruction access. To quantify this effect, the average number of sequential instructions 
executed between taken branches was measured. As shown in Table 5. all of the three optimizations 
impiove the sequentiality significantly. With all optimizations, the average number of sequential in- 
structions increased from 4.6 to 12.3. This dramatic increase in sequentiality suggests that schemes 
mm h load forwarding may be able to offset the negative effect of code expansion. We will further 
explore ibis subject later in this section. 

Although the static code size increases significantly after the code expanding optimizations, the 
number ut dynamic inst ruction references tends to decrease with each additional level of optimiza- 
tions. Table 6 presents the number of instruction references for each benchmark program. The 
laryrM improvement results from function inline expansion: this is due to the increasing opportunity 
to apply classic local and global optimizations on the inlined version of the code and to eliminate 
instructions that save and restore registers across function boundaries. The purpose for super- 
" ril ^ n optimization.-, is to uncover parallelism and scheduling opport unit ie>. Not e however, t hat 
mi peiM a la r optimizations often result in a decrease in the number of instruction references. The 
contribution of instruction placement to the number of dynamic references is small when compared 
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200 

espresso 

2170 
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2320 
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1140 

1130 

1210 

1740 

mpla 

3500 

3300 

4200 

5620 

1900 

1700 

2200 

2970 

tbl 

1310 

1270 

1510 

2000 

690 

660 

780 

1070 

xlisp 

800 

700 

800 

1100 

400 

400 

500 

600 

yacc 

980 

910 

1040 

2020 

530 

480 

550 

1060 
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64 byte block 

128 byte block ] 

cccp 

240 

230 

260 

310 

140 
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140 

170 

eqntott 

100 

200 

100 

100 

90 

100 

100 

90 

espresso 

600 

600 

640 

940 ' 

320 

330 

350 

520 

mpla. 

1000 

900 

1200 

160u 

600 

500 

700 

870 

tbl 

360 

350 

420 

570 

180 

180 

220 

300 

xlisp 

300 

300 

300 

300 

200 

200 

200 

200 

yacc 

290 

250 

300 

570 

160 

130 

160 

310 


Table 7: Number of unique references. 


ro the other optimizations since instruction placement only performs code reordering. 

The sum of the number of recurrent references and the number of unique references constitutes 
the number of total dynamic references. Table 7 shows that the number of unique references 
increases for inlining and superscalar optimizations, but decreases for instruction placement. The 
absolute difference within the unique references does not constitute a significant variation in the 
mis. ratio since i he difference is insignificant when compared to the number of dynamic references 


in Table b. 


Instruction Placement 




Figure 5: Average effect of placement.. 



I igure (j: [’lie ellect ot placement for the highest miss ratios. 



dimensional miss ratio without placem 
1ZZZZZZZZ& dimensional miss ratio with placement 



Figure 7: Effect of placement on dimensional conflicts and unique references. 

Figure ' shows the effect of instruction placement on the average cache miss ratio 4 . On one hand, 
instruction placement reduces miss ratio for small caches (IK and 2K). For example, the miss ratio 
of a IK cache with placement is comparable to that of a 2K cache without placement. On the 
other hand, instruction placement has very little effect on large caches (8K and 1(>K). The same 
trend can be observed from the worst case miss ratios in Figure 6. The worst case miss ratio is the 
maximal miss ratio observed among all benchmark programs. Note that the benefit ot instruction 
placement is more pronounced for programs with high miss ratios. This is a very desirable effect 

since it increases the stability of the cache performance. 

To analyze why instruction placement improves the performance oi small caches, we have mea- 
sured the misses due to unique references (intrinsic misses, see Section 2) and those due to dimen- 
sional conflicts (dimensional misses). The log plot of Figure 7 shows the contribution of each to 


'We found that, the effect of instruction placement, on the cache miss ratio of other associativities closely follows 
i,. ml ol the direct mapped cache case, therefore only the direct mapped cache results are presented. 
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the miss ratio with and without placement. The black bars show the intrinsic miss ratio. Figure 7 
clearly indicates that instruction placement makes negligible difference in the number of intrinsic 
misses \ The shaded bars in Figure 7 show the dimensional misses. As can be seen in the figure, 
the reduced miss ratio after placement is due to decreased dimensional conflicts 6 . 

The changes in program behavior due to instruction placement explain the discrepancy between 
small and large caches. The working set of the benchmark programs do not fit into small caches. 
This accounts for the high miss ratio of the small caches. Instruction placement separates the 
frequently executed code segments from those executed infrequently. This helps the small caches 
to accommodate the frequently executed portions of the programs. Therefore, the performance of 
Mnall caches improves significantly after instruction placement. Since large caches can accommodate 
the working set of most benchmark programs, the compaction effect of instruction placement, does 
not make a significant difference for these cache sizes. 

Function Inline Expansion 

function inline expansion has two conflicting effects on cache performance. On the positive side, 
with inlining the caller and callee bodies are processed together by instruction placement. This 
allows instruction placement to significantly increase the sequentiality of the program (see Table 5). 
^ Imn the cache miss ratio is high, the increased sequentiality reduces the miss ratio because it 
increases the number of useful bytes transferred for each cache miss. On the negative side, inlining 
increases t he working set size (see Tables :l and d ). If the working set fits into a cache before inlining 

I In- reader is encouraged to derive the intrinsic miss ratio hv dividing the number of uni(|iie references m Table 7 
with ili< number of dynamic references in Table (>. 

\ut. that figure . is in log scale, which is necessary io make the intrinsic miss ratio visible. However, the log 
s ‘ oh a I >o magnifies the miss ratio of large caches For example, instruction placement seem to make comparable 
ditfei- rn. for Muall cache* < 1 1\ and JK ) and large caches (IGK and T-K) in Figure 7. However, it is clear from 
F su.il t * ' dial instruction placement has strong effect on small caches but negligible effect on large caches. 




Figure LO: Effect of superscalar optimizations for direct mapped cache. 


but does not after inlining, the cache miss ratio may increase substantially. 

f igures 8 and 9 show the effect of inline function expansion on cache performance 7 . File cache 
ims> ratio is relatively high for small caches before inlining. In this range, the increased sequentiality 
reduces the ( ache miss ratio. In the middle range (8K. 16K, and 3‘2K), the working sets of some 
beu< li mat ks fit in the cache before inlining but not after inlining. As a result, inlining increases 
cache miss ratio. I lie ()4I\ cache is large enough to accommodate the program working set before 
and after mlining. Therefore, inlining has negligible effect in caches of size (>4K and greater. 

Superscalar Optimizations 

F igure LO shows the changes in the cache miss ratios when superscalar optimizations are applied 
altei mlining and placement. The miss ratios are consistently higher with superscalar optimizations. 

1 ImrHnrm a larger cache is required to compensate for the effect of superscalar optimizations to 
maintain the sanm miss ratio. J his information is consistent with the working set. size calculated in 

\ > ht-ior<\ the trend tor higher set associativities is very close to the results tor direct mapped cache. Thus, onlv 
Mu direct mapped results are presented. 
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dimensional miss ratio w 

dimensional miss ratio with super-scaiar opti., 
ZZZZZZZZZL 3 inlining, and placement 

intrinsic miss ratio 

10-1 



Direct Mapped Cache 


Figure 11: Effect of superscalar optimizations on dimensional conflicts and unique references. 

Table 4. If the block sizes are kept constant, the required cache size to maintain the same level of 
miss ratio is approximately twice the cache size over that of code with no superscalar optimizations. 

Figure LI indicates that superscalar optimizations increase the number ol unique references, 
but the increase is not significant. Therefore, it is the increase in code size rather than the increase 
in unique references that is the primary cause of reduced cache performance. 


All Optimizations 

Filin' 12 shows the cumulative effect of all optimizations on direct mapped caches. Intuitively, 
smaller < aches should perform worse on expanded code because of increase in the expected iminlmi 
,,| dimensional conflicts. However, the experimental data show the opposite. For the Lk and 2k 
cache.-., the miss ratio of code without code expanding optimizations are larger than the miss ratios 
Of code With code expanding optimizations. Sequentiality is increased by superscalar optimizations. 
Hu is for larger block size, the decrease in miss ratio is due to sequentiality (e.g.. for IK cache m 







Figure 12. code with superscalar optimizations has a larger drop in miss ratio going from 64B to 
L28B block size than code with no optimization). For small block sizes, the positive effect of higher 
sequentiality disapears. and the negative effect of code expansion causes an increase in the miss 
ratio. However, the increase in code locality by function inlining and instruction placement is still 
large enough to offset the negative effect of the code expansion, and a slight decrease in the miss 
ratio can still be seen in small caches. 


Load Forwarding 

The results of load forwarding are presented in Figure 13. Since superscalar optimizations have 
the worst results thus far. they are used here to evaluate the effectiveness of load forwarding. The 
initial memory repair latency ( L ) is assumed to be 4 cycles, and the cache- memory bandwidth (J) 
assumed to be 1 bytes. Equations 3 and 4 are used to calculate the relative miss time penalty. 
Toad forwarding reduces the miss penalty and effectively upgrades the cache to a performance 
level similar to a non load-fonvarding cache of twice the size. For example, assume that 2k direct 
mapped cache with block size of 64 bytes is used with load forwarding. Using the same block size, 
t ||e miss penalty is approximately the same as that of a 4K cache without load forwarding. When 
superscalar optimizations are used, the designer can either double the cache size to maintain the 
mine performance level or use load forwarding and achieve the same result. 

Another observation is that a block size of 128 bytes has consistently higher average miss 
penalties i hail for other block sizes. This can be explained by the number of sequential instructions 
shown in fable .). The overall average run length for superscalar optimizations is approximately 
12.3 insi ructions i 19.2 bytes). It is possible that the first non-sequential miss will not be m the 
beginning of the block (see Figure 14). By using the symbol R for the run length, and / as t he run 
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Direct Mapped Cache 

1 igure 13: Effect of load forwarding for direct mapped cache 



Figure 14: Reference stream and cache block refills. 



length starting location within the cache block, the total number of cache blocks involved in a miss 


is formulated as. 


P (/ + R), 

W. B. R) = r^i- 


The ceiling function is used to include all used cache blocks. For each run length, there are 2 / ,3 
starting locations. Assuming uniform distribution for all starting locations, the probability of each 
starting location would be (3/ 2^. Therefore, the penalty of each cache miss foi a paiticulai run 
length is shown as Equation 7. 

" T_l i ->B 

P(R. B) - Y, x {W-B.R)x )- R ) 

/=o ' ’ 

1 ui simplicity, an integer approximation of the run length is used. Instead of L2.3. the value of 13 
is used for R in Equations 6 and 7. 


P( 13.4) = 19 cycles 



P{ 13. 5) = 17 cycle s 

(9) 

— 

P{ 13,6) = 22 cycle a 

(10) 


P(13,7) = 36.5 cycles 

(li) 


The calculated values follow the trend in Figure 13 closely. For B equal to 4. 5, and 6. the load 
forwarding miss penalties are relatively the same, with B equal to 5 (the lowest), and B equal to 
1 Mho next lowest.). For B equal to 7. the load forwarding miss penalty is noticeably higher than 
the oilier block sizes, and this can also be shown by using Equation 7. 

3 lm miss penalty for each run of sequential accesses is dominated by three values: the initial 
load delay, the number of refill cycles with load forwarding, and the number of refill cycles without 
load forwarding. While the initial load delay is dependent upon the hardware design technology, the 
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Figure 15: Effect, of initial load delay (4k cache). 


non-stalling and stalling refill cycles are related to the block size and the instruction sequentiality. 
Before the initial load delay reaches a certain threshold value, the number of refill cycles will have a. 
dominant effect upon the miss penalty. Larger block sizes will tend to have higher wasted number 
of n* til I cycles than smaller block sizes. However, larger block sizes are penalized less for the initial 
load delay than smaller block sizes. Figure 15 shows the effect of varying the value of the initial 
load delay on block sizes for a 4k cache. For each value of L, the miss penalty ratio is compared 
between four block sizes. For small values of L< 16 and 32- byte blocks perform the best. But for 
larger values of L, 64- byte block performs the best. This is also verified by Equation 7. Here, the 
value <>!' i i> set tu 10. 

P{ 13,4) = 43 cycle* (12) 

Pi 13. 5 ) = 32 cycles (13) 

Pi 13.0) = 32.5 cycles I 14) 

Pi 13. 7) — 44.75 cycles { 15) 
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From Figure 15. for initial delay of 10. block sizes of 32 and 64 bytes have similar performances. 

and block sizes of 16 and 128 bytes have similar performances. 

As the value of L increases, the performance of the larger block sizes increases while the perfor- 
mance of the smaller block sizes decreases. It is not until an initial load delay of 40 cycles before 
128-byte blocks start to out-perform other block sizes. For smaller cache sizes, the miss ratios are 
the dominating factor, and a smaller block size should be used. On the contrary, for larger cache 
sizes, since the miss ratios are very small, larger block sizes aie piefened. 


5 Conclusions 


phis paper analyzes the effect of compile-time code expanding optimizations on instruction cache 
desie.ii. We first show that instruction placement, function inline expansion, and superscalar op- 
timizations cause substantial code expansion, reinforcing the concern that they may increase the 
cache size required to achieve a given performance level. We then show the actual effect ol each 


optimization on cache design. 

Among the three types of optimizations, instruction placement causes the least amount ol code 
expansion. Its effects on the cache performance are mostly due to the increased instruction access 
sequent iality. For small caches where the miss ratio is relatively high, the increased sequential- 
ity reduces the number of cache misses by increasing the useful bytes transferred for each cache 
mis*. For large caches where the miss ratio is relatively low. the effect of inst ruction placement is 


negligi Mm 


liiiiiM’ function expansion affect 
.md tin* working set size. For small 


s the ('ache performance by increasing both the sequent ialit \ 
caches where the miss ratio is high, the increased sequentiality 
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helps to reduce the miss ratio. Due to the increased working set size, some benchmarks which fit 
into moderately sized caches before inlining do not fit after inlining. Therefore, inlining increases 


the miss ratio of moderatelv-sized caches. For large caches, since the working sets fit in the cache 
before and after the cache, the effect of inlining is insignificant. 

Superscalar optimizations increase the cache size required for a given miss ratio. However, 
they increase the sequentiality of instruction access so much that a simple load-forward scheme 
effectively cancels the negative effects. Using load forwarding, the three types of code-expanding 
optimizations jointly improves the performance of small caches in spite of the substantial code 
expansion. Load forwarding also allows the code expanding optimization to have little negative 
effec t on the performance of large caches. 
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